Aim: Getting started with statistical approaches and bioinformatics tools commonly used to analyze microarray experiments and to select genes according to their expression profiles.
This practical is divided in 3 sections:
The first microarray datasets were collected from the publication of Guida et al.2011. The authors used high throughput technologies (microarrays and high throughput sequencing) to determine the transcriptional profile of the pathogenic yeast Candida parapsilosis growing in several conditions including media, temperature and oxygen concentrations. You will use the datasets related to the study of the hypoxic (low oxygen) response in C. parapsilosis.
The experiments were performed comparing one cell culture incubated at atmospheric oxygen conditions (call normoxic and labelled using Cy3 dye) and another one incubated in 1% O2 (call hypoxic and labelled using Cy5 dye).
Input : A GPR file with detailed information for each spot on the slide (gene name, Cy5 and Cy3 intensity values, background intensities and other statistics).
library(marray)
The R package Marray offers several functions to:
# Read the GPR file using the marray package function read.GenePix
rawdata <- read.GenePix(fnames="dataFile1_normAnalysis.gpr",
path= "/shared/projects/2420_ens_hts/data/microarrays/data")
## Reading ... /shared/projects/2420_ens_hts/data/microarrays/data/dataFile1_normAnalysis.gpr
Note : This function reads a GPR file and creates objects of class marrayRaw. In these objects, you can find, for instance, vectors with intensity values (rawdata@maRf or rawdata@maGf). These vectors can be manipulated using classical R functions like “summary()”, “hist()”, etc.
Take a few minutes to better understand the structure of the R object “marrayRaw”. Start for instance to manipulate the vectors with foreground signals (rawdata@maRf or rawdata@maGf).
# f is for foreground
# Intensity values in red/hypoxic channel
head(rawdata@maRf)
## /shared/projects/2420_ens_hts/data/microarrays/data/dataFile1_normAnalysis.gpr
## [1,] 992
## [2,] 1907
## [3,] 559
## [4,] 645
## [5,] 32939
## [6,] 681
# Intensity values in green/normoxic channel
head(rawdata@maGf)
## /shared/projects/2420_ens_hts/data/microarrays/data/dataFile1_normAnalysis.gpr
## [1,] 2561
## [2,] 2585
## [3,] 1588
## [4,] 1604
## [5,] 43755
## [6,] 1732
Visualisation of foreground signal
# Red/hypoxic signals
image(rawdata,
xvar = "maRf",
main = "Hypoxic signal (with flags)")
## [1] FALSE
## NULL
# Green/normoxic signals
image(rawdata,
xvar = "maGf",
main = "Normoxic signal (with flags)")
## [1] FALSE
## NULL
Visualize background signals in Red/Hypoxic and Green/Normoxic channels. Try to interpret the obtained results. How is the quality of the experiment?
# b is for background
# Red/Hypoxic channel background signals
image(rawdata,
xvar = "maRb",
main = "Hypoxic background (with flags)")
## [1] FALSE
## NULL
# Green/Normoxic channel background signals
image(rawdata,
xvar = "maGb",
main = "Normoxic background (with flags)")
## [1] FALSE
## NULL
Each spot is associated with a flag value reporting some quality information. Flag values are the following:
# Distribution of spots by flag values
table(rawdata@maW)
##
## -100 -75 -50 0
## 103 192 5922 9335
Negatively flagged spots will be eliminated from further analyses. For that, intensity values in foreground and background signals have to be replaced with the R symbol “NA” (missing values, “Not Available”).
# You work on a copy of the rawdata
rawdataWithoutFlags <- rawdata
#Background values to NA
rawdataWithoutFlags@maRb[rawdataWithoutFlags@maW < 0] = NA
rawdataWithoutFlags@maGb[rawdataWithoutFlags@maW < 0] = NA
#Signal values to NA
rawdataWithoutFlags@maRf[rawdataWithoutFlags@maW < 0] = NA
rawdataWithoutFlags@maGf[rawdataWithoutFlags@maW < 0] = NA
An intuitive approach for background correction consists in subtracting background intensity values (rawdata@maRb and rawdata@maGb) from global foreground intensities (rawdata@maRf and rawdata@maGf). Nevertheless this method is debatable because for low intensities it adds more noise to the data and creates overestimated log(Ratio) values.
#Draw a MA plot with background subtracted from foreground values
plot(rawdataWithoutFlags,legend.func = NULL, main = "MA plot before normalization (background subtraction)")
#Replace all background by 0
rawdataWithoutFlags@maGb[] = 0
rawdataWithoutFlags@maRb[] = 0
#Draw a MA plot with foreground values only
plot(rawdataWithoutFlags,legend.func = NULL, main = "MA plot before normalization (no background subtraction)")
The following analyses will be performed with no background subtraction.
plot(rawdataWithoutFlags, main = "MA plot before normalization")
boxplot(rawdataWithoutFlags, main = "Boxplot before normalization")
rawdataWithoutFlagsNorm <- maNorm(rawdataWithoutFlags, norm = "median", echo = T)
## Normalization method: median.
## Normalizing array 1.
rawdataWithoutFlagsNorm2 <- maNorm(rawdataWithoutFlags, norm = "loess", echo = T)
## Normalization method: loess.
## Normalizing array 1.
rawdataWithoutFlagsNorm3 <- maNorm(rawdataWithoutFlags, norm = "printTipLoess", echo = T)
## Normalization method: printTipLoess.
## Normalizing array 1.
Several plots allow for comparison of the normalization methods
plot(rawdataWithoutFlagsNorm, legend.func = NULL, main = "norm = Median")
plot(rawdataWithoutFlagsNorm2, legend.func = NULL, main = "norm = Loess")
plot(rawdataWithoutFlagsNorm3, legend.func = NULL, main = "norm = printTipLoess")
boxplot(rawdataWithoutFlagsNorm, main = "norm = Median")
boxplot(rawdataWithoutFlagsNorm2, main = "norm = Loess")
boxplot(rawdataWithoutFlagsNorm3, main = "norm = printTipLoess")
plot(density(maM(rawdataWithoutFlagsNorm2),na.rm = T),
lwd = 2, col = 2, main = "Distribution of log(Ratio)")
lines(density(maM(rawdataWithoutFlags), na.rm = T), lwd = 2)
abline(v = 0)
legend(x= 0.5, y= 1.2,c("Before normalization","After normalization with loess"), fill = c(1,2))
In their article (Guida et al., 2011), the authors repeated the experiment 4 times for normoxic condition (with O2) and 4 times for hypoxic conditions (without O2). Expressions of genes between the two conditions were compared using microarrays (Ratio = hypoxia / normoxia).
You will perform the DE analysis using the limma package
library(limma)
Input: A text file with four different biological replicates (after normalization).
dataFile <- "/shared/projects/2420_ens_hts/data/microarrays/data/dataFile_diffAnalysis.txt"
data <- as.matrix(read.table(dataFile, row.names = 1, header = T))
# Retrieve some information from the data table
dim(data)
## [1] 5526 4
data[1:10,1:4]
## logVal1 logVal2 logVal3 logVal4
## CPAR2_201050 -0.265265616 -0.130465012 0.008997103 -0.06624613
## CPAR2_101960 -0.843512598 -0.608422137 -0.103000282 -0.45358870
## CPAR2_101290 0.056414092 0.000296908 -0.068354697 0.05983511
## CPAR2_405520 0.464588136 0.509999239 0.284349940 0.44530769
## CPAR2_201590 -0.230207648 -0.176294382 -0.265324830 -0.24833664
## CPAR2_103750 -0.194992750 -0.186335163 0.191242260 -0.57185971
## CPAR2_100170 -0.132982234 -0.191465175 -0.126354218 0.00331530
## CPAR2_202790 0.973402061 0.853915233 0.808972712 0.74969076
## CPAR2_301860 -0.008917937 0.018171339 -0.021780941 0.16899955
## CPAR2_106430 -1.598703129 -1.508676852 -0.642865880 -0.87494246
summary(data)
## logVal1 logVal2 logVal3 logVal4
## Min. :-3.20789 Min. :-2.902426 Min. :-3.634918 Min. :-3.854502
## 1st Qu.:-0.33699 1st Qu.:-0.318338 1st Qu.:-0.252692 1st Qu.:-0.278030
## Median :-0.01331 Median :-0.002482 Median : 0.002798 Median :-0.009852
## Mean : 0.02278 Mean : 0.025141 Mean : 0.075240 Mean : 0.014181
## 3rd Qu.: 0.30528 3rd Qu.: 0.309444 3rd Qu.: 0.322220 3rd Qu.: 0.282426
## Max. : 6.65491 Max. : 6.422750 Max. : 6.559013 Max. : 6.213929
# Linear model estimation
fit <- lmFit(data)
# Bayesian statistics
limma.res <- eBayes(fit)
# Overview of the most differentially expressed genes
head(topTable(limma.res))
## logFC AveExpr t P.Value adj.P.Val B
## CPAR2_404850 6.462651 6.462651 71.45643 3.788865e-10 2.093727e-06 13.59386
## CPAR2_503990 5.168192 5.168192 61.93992 9.047930e-10 2.499943e-06 13.02649
## CPAR2_502580 3.504953 3.504953 50.62005 3.091002e-09 4.333583e-06 12.10863
## CPAR2_807620 3.614666 3.614666 50.49767 3.136868e-09 4.333583e-06 12.09684
## CPAR2_401230 3.328905 3.328905 45.27100 6.097772e-09 5.982882e-06 11.54690
## CPAG_00607 3.396506 3.396506 43.79661 7.457963e-09 5.982882e-06 11.37366
# Retrieve result table for all genes
allgenes.limma <- topTable(limma.res, number = nrow(data))
You will draw a Volcano plot using the EnhanceVolcano Bioconductor package
library(EnhancedVolcano)
logFCthreshold <- 2
adjPVthreshold <- 0.01
EnhancedVolcano(allgenes.limma,
lab = rownames(allgenes.limma),
x = 'logFC',
y = 'adj.P.Val',
axisLabSize = 12,
ylim = c(0,max(-log10(allgenes.limma[[5]]))),
caption = paste0(nrow(allgenes.limma), " total genes"),
captionLabSize = 12,
FCcutoff = logFCthreshold,
pCutoff = adjPVthreshold,
title = 'Hypoxia VS Normoxia',
subtitle = 'Volcano plot',
pointSize = 2.0,
labSize = 4.0,
legendLabSize = 12,
legendIconSize = 2.0,
drawConnectors = TRUE)
## Warning: ggrepel: 58 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
You can use the option “selectLab” to display only the genes you are interested in on the graph passing a vector of gene names.
EnhancedVolcano(allgenes.limma,
lab = rownames(allgenes.limma),
x = 'logFC',
y = 'adj.P.Val',
selectLab = c('CPAR2_404850','CPAR2_602990', 'CPAR2_802720'),
axisLabSize = 12,
ylim = c(0,max(-log10(allgenes.limma[[5]]))),
caption = paste0(nrow(allgenes.limma), " total genes"),
captionLabSize = 12,
FCcutoff = logFCthreshold,
pCutoff = adjPVthreshold,
title = 'Hypoxia VS Normoxia',
subtitle = 'Volcano plot',
pointSize = 2.0,
labSize = 4.0,
legendLabSize = 12,
legendIconSize = 2.0,
drawConnectors = TRUE)
# Filter differentially expressed genes based on previous thresholds
siggenes.limma <- allgenes.limma[allgenes.limma[,5] < adjPVthreshold & abs(allgenes.limma[,1]) > logFCthreshold,]
paste(dim(siggenes.limma[siggenes.limma[,1] > 0,])[1], "upregulated genes in Hypoxia")
## [1] "78 upregulated genes in Hypoxia"
paste(dim(siggenes.limma[siggenes.limma[,1] < 0,])[1], "upregulated genes in Normoxia")
## [1] "11 upregulated genes in Normoxia"
# Export DE gene table into your home directory:
write.table(siggenes.limma[siggenes.limma[,1] > 0,],
row.names = T, quote = F, sep = ";",
file = "~/hypoxia_signif_genes.csv")
write.table(siggenes.limma[siggenes.limma[,1] < 0,],
row.names = T, quote = F, sep = ";",
file = "~/normoxia_signif_genes.csv")
print(siggenes.limma[siggenes.limma[,1] < 0,])
## logFC AveExpr t P.Value adj.P.Val B
## CPAR2_602990 -2.857353 -2.857353 -40.921642 1.126926e-08 6.227391e-06 11.009444
## CPAR2_603010 -2.386064 -2.386064 -36.666590 2.195952e-08 7.238432e-06 10.396967
## CPAR2_700940 -2.716475 -2.716475 -33.338684 3.912918e-08 9.828540e-06 9.845186
## CPAR2_808020 -2.107592 -2.107592 -26.385586 1.614952e-07 2.059145e-05 8.421440
## CPAR2_108370 -3.399933 -3.399933 -21.054322 6.312015e-07 4.166148e-05 6.981216
## CPAR2_207540 -2.456284 -2.456284 -21.027727 6.360268e-07 4.166148e-05 6.973018
## CPAR2_300630 -2.913243 -2.913243 -20.849107 6.695692e-07 4.302371e-05 6.917658
## CPAR2_603050 -2.369657 -2.369657 -18.816323 1.241135e-06 6.178842e-05 6.248014
## CPAR2_802720 -2.030915 -2.030915 -13.789453 7.945638e-06 2.080929e-04 4.191489
## CPAR2_800530 -2.756817 -2.756817 -11.269853 2.610170e-05 4.397499e-04 2.852306
## CPAR2_802670 -2.383530 -2.383530 -8.602067 1.243087e-04 1.152567e-03 1.082295
Several tools are available online to evaluate the biological relevance of the gene sets you select after the differential analysis. For example you can you use the GoTermFinder tool dedicated to Candida yeast species to retrieve functional annotation. Be careful to choose the genome of Candida parapsilosis in the species selection dialog box. You can also obtain information on a specific gene using the Candida Genome Database.
#Date
format(Sys.time(), "%d %B, %Y, %H,%M")
## [1] "20 August, 2024, 13,53"
#Packages used
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.4.1/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Paris
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] EnhancedVolcano_1.22.0 ggrepel_0.9.5 ggplot2_3.5.1
## [4] marray_1.82.0 limma_3.60.3
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 jsonlite_1.8.8 dplyr_1.1.4 compiler_4.4.1
## [5] highr_0.11 tidyselect_1.2.1 Rcpp_1.0.12 jquerylib_0.1.4
## [9] scales_1.3.0 yaml_2.3.9 fastmap_1.2.0 statmod_1.5.0
## [13] R6_2.5.1 labeling_0.4.3 generics_0.1.3 knitr_1.48
## [17] tibble_3.2.1 munsell_0.5.1 bslib_0.7.0 pillar_1.9.0
## [21] rlang_1.1.4 utf8_1.2.4 cachem_1.1.0 xfun_0.45
## [25] sass_0.4.9 cli_3.6.3 withr_3.0.0 magrittr_2.0.3
## [29] digest_0.6.36 grid_4.4.1 rstudioapi_0.16.0 lifecycle_1.0.4
## [33] vctrs_0.6.5 evaluate_0.24.0 glue_1.7.0 farver_2.1.2
## [37] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.27 tools_4.4.1
## [41] pkgconfig_2.0.3 htmltools_0.5.8.1